words
congruent
:全等
Abstract
The paraphrase Database (PPDB) consists of a list of phrase pairs with heuristic confidence estimates. Its coverage is also necessarily incomplete.
They propose models to:
- score paraphrase pairs more accurately than PPDB’s internal scores
- improve its coverage
They also introduced two new, manually annotated datasets to evaluate short-phrase paraphrasing models:
- Annotated-PPDB
- ML-Paraphrase
Introduction
Paraphrase detection
:the task of analyzing two segments of text and determining if they have the same meaning despite differences in structure and wording.
Drawbacks of PPDB:
- lack of coverage
- that PPDB is a nonparametric paraphrase model the number of parameters (phrase pairs) grows with the size of the dataset used to build it.
What this work do:
- they show the initial skip-gram word vectors can be fine-tuned for the paraphrase task by training on word pairs from
PPDB
calledPARAGRAM
word vectors. - they show that their resulting word and phrase representation are effective on a wide variety of tasks
Contributions:
Provide new PARAGRAM word vectors
which improves performance in sentiment analysis and achieve the-stat-of-art inSimLex-999
Provide ways to use PPDB to embed phrases
Introduce two new datasets
New Paraphrase Datasets
Annotated-PPDB
Most existing paraphrase focus on words, like SimLex-999
or entire sentences, such as the Microsoft Research Paraphrase Corpus.
- filter phrases for quality
- filter by lexical overlap
- select range of paraphrasabilities (可译释性)
- Annotate with Mechanical Turk
Finally, they selected 1260 phrase pairs from the 3000 annotations. These 1260 examples were then randomly split into a development set of 260 examples and a test set of 1000 examples.
ML-Paraphrase
The second newly-annotated dataset is based on the bigram similarity task. They found that the annotations were not consisten with the notion of similarity central to paraphrase tasks. For instance, television set
and television programme
were the highest rated phrases in the NN section. older man
and elderly woman
also is one of the highest ranked JN pairs.
Paraphrase Models
The goal is to embed phrases into a low-dimensional space such that cosine similarity in the space corresponds to the strength of the paraphrase relationship between phrases. Recursive neural network was usde.
For phrase $p$, they compute its vector $g(p)$ through recursive computation on the parse. If $p$ is parent node and $c_1$ and $c_2$ are its child nodes:
$g(p)=f(W[g(c_1);g(c_2)]+b)$
The $W$ is not word embeddings. It’s weight matrix.
If $p$ does not have child nodes:
$g(p)=W_w^{(p)}$
The objective function follows:
$$min_{W,b,W_w}=\frac{1}{|X|}(\sum_{<x_1,x_2>\in X} max(0,\delta-g(x_1) \cdot g(x_2) + g(x_1) \cdot g(t_1)) \\ + max(0,\delta-g(x_1) \cdot g(x_2) + g(x_2) \cdot g(t_2)))+ \\ \lambda_W(||W||^2+||b||^2)+\lambda_{W_w}||W_{w_{initial}}-W_w||^2$$
where $\delta$ is the
margin
(set to 1 in all of the experiments), and $t_1$ and $t_2$ are carefully-selected negative examples
taken from a mini-batch during optimization.The intuition for this objective is that we want the two phrases to be more similar to each other ($g(x_1)\cdot g(x_2)$) than either is to their respective negative examples $t_1$ and $t_2$, by a margin of at least $\delta$.
### Selecting Negative Examples ###
To select $t_1$ and $t_2$, the most similar phrase in the mini-batch is chosen.
$t_1 = argmax_{t:<t,\cdot> \in X_b \ \{<x_1,x_2>\}} g(x_1) \cdot g(t)$
where $X_b \subseteq X$ is the current mini-batch.
### Training Word Paraphrase Models ###
To train just word vectors on word paraphrase pairs:
$$min_{W_w}=\frac{1}{|X|}(\sum_{<x_1,x_2>\in X} max(0,\delta-W_w^{(x_1)} \cdot W_w^{(x_2)} + W_w^{(x_1)} \cdot W_w^{(t_1)}) \\ + max(0,\delta-W_w^{(x_1)} \cdot W_w^{(x_2)} + W_w^{(x_2)} \cdot W_w^{(t_2)})+ \\ \lambda_{W_w}||W_{w_{initial}}-W_w||^2$$
Experiments-Word Paraphrasing
Training Procedure
They did a coarse grid search over a parameter space for $\lambda_{W_w}$ and the mini-batch size. They trained for 20 epoches for each set of hyperparameters using AdaGrad.
Tunning and Evaluation
Maximum $2\times$ WS-S correlation minus the WS-R correlation. The idea was to reward vectors with high similarity and relatively low relatedness, in order to target the paraphrase relationship. They choose SL999 as their primary test set as it most closely evaluates the paraphrase relationship. Note for all experiments they used cosine similarity as their similarity metric and evaluated the statistical significance of dependent correlations using the one-tailed method of (Steiger, 1980).
这里有一个问题,WS-S数据集作为一个用于调试的数据,是否保证和SL999没有交集。
Experiments-Compositional Paraphrasing
Using a support vector regression model (epsilon-SVR) on the 33 features that are included for each phrase pair in PPDB. The parameters are tuned using 5-fold cross validation on our dev set. Then, the model was trained on the entire dev set after finding the best performing $C$ and $\epsilon$ combionation and evaluated on the test set of Annotated-PPDB.